[WIP] Add Qwen3.5 h200 MTP by hshrivastava-droid · Pull Request #921 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-03-20T01:57:43Z

No description provided.

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

github-actions · 2026-03-20T01:57:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

claude · 2026-03-20T02:21:13Z

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh

+pip install -q datasets pandas
+
+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend vllm \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$((CONC * 10))" \
+    --max-concurrency "$CONC" \


🔴 The run_benchmark_serving call is missing --use-chat-template, which every other MTP benchmark script in the repo (6 out of 6) includes. Without this flag, MTP acceptance rates are artificially high because raw text without chat formatting special tokens is easier for the draft model to predict, producing misleading benchmark results. Add --use-chat-template after the --result-dir line to match the established pattern.

Extended reasoning...

What the bug is

The new qwen3.5_fp8_h200_mtp.sh benchmark script omits --use-chat-template from its run_benchmark_serving call (lines 69-79). This flag is present in every other MTP benchmark script in the repository.

Evidence of the pattern

All 6 existing single-node MTP benchmark scripts include --use-chat-template:

dsr1_fp8_b200_mtp.sh (line 108)

dsr1_fp4_b200_trt_mtp.sh (line 133)

dsr1_fp8_b200_trt_mtp.sh (line 143)

dsr1_fp8_h200_trt_mtp.sh (line 115)

dsr1_fp4_mi355x_atom_mtp.sh (line 71)

dsr1_fp8_mi355x_atom_mtp.sh (line 70)

Additionally, the multi-node AMD utility (bench.sh:60) adds this flag generically for ALL MTP benchmarks via [ "$IS_MTP" = "true" ] && echo "--use-chat-template", confirming this is a model-agnostic requirement, not DeepSeek-specific.

Root cause

The script was likely copied from the non-MTP qwen3.5_fp8_h200.sh (which correctly omits the flag since MTP acceptance rates are irrelevant without speculative decoding) but failed to add --use-chat-template as all other MTP scripts do.

Step-by-step proof of impact

The benchmark runs with EAGLE speculative decoding enabled (--speculative-algorithm EAGLE, lines 55-58).

run_benchmark_serving sends prompts to the server. Without --use-chat-template, raw text is sent without chat formatting special tokens.

The draft model finds raw text easier to predict than properly formatted chat messages (which contain special tokens like <|im_start|>, <|im_end|>, etc.).

This results in artificially higher MTP acceptance rates.

The benchmark reports misleadingly optimistic throughput numbers that won't reflect real-world chat serving performance.

How to fix

Add --use-chat-template to the run_benchmark_serving call, e.g. after --result-dir /workspace/. This is a one-line addition that aligns the script with every other MTP benchmark in the repository.

claude · 2026-03-20T02:21:13Z

perf-changelog.yaml

+    - qwen3.5-fp8-h200-sglang-mtp
+  description:
+    - "Add Qwen3.5-397B-A17B-FP8 H200 SGLang MTP (EAGLE speculative decoding)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921


🟡 Nit: The pr-link for the new qwen3.5-fp8-h200-sglang-mtp entry uses a placeholder /pull/XXX instead of /pull/921. Please update before merging.

Extended reasoning...

Bug Description

The new perf-changelog entry added at line 987 for qwen3.5-fp8-h200-sglang-mtp uses a placeholder PR link:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

instead of the actual PR number:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Code Path

The diff adds a new changelog block at the end of perf-changelog.yaml (lines 982-987). Every other entry in the file that was finalized has a concrete PR number in its pr-link field, making this an outlier that needs updating.

Pre-existing Context

There are several other pre-existing XXX placeholders in the file (e.g., for glm5-fp8-mi355x-sglang, dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang). However, those are from other PRs and outside the scope of this change. This PR should fix its own entry.

Impact

The impact is low — this is a metadata/documentation field, not functional code. The placeholder link would point to a nonexistent or incorrect pull request page, making it harder for someone reviewing the changelog to trace the entry back to its source PR.

Suggested Fix

Replace XXX with 921 on line 987:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/921

Given that the PR title is [WIP], this is likely a known TODO that the author plans to fix before final merge. Flagging it here as a reminder.

Add Qwen3.5 h200 MTP

734c1ba

hshrivastava-droid requested a review from a team March 20, 2026 01:57

hshrivastava-droid requested review from ankursingh-nv, jgangani and kedarpotdar-nv as code owners March 20, 2026 01:57

github-project-automation bot added this to InferenceMAX Board Mar 20, 2026

extend conc

3d80687

hshrivastava-droid added NVIDIA sweep-enabled labels Mar 20, 2026

claude bot reviewed Mar 20, 2026

View reviewed changes

adding flag

96a8f02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add Qwen3.5 h200 MTP#921

[WIP] Add Qwen3.5 h200 MTP#921
hshrivastava-droid wants to merge 3 commits intomainfrom
nv/h200-qwen35

hshrivastava-droid commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hshrivastava-droid commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

What the bug is

Evidence of the pattern

Root cause

Step-by-step proof of impact

How to fix

Uh oh!

claude bot Mar 20, 2026

Choose a reason for hiding this comment

Bug Description

Code Path

Pre-existing Context

Impact

Suggested Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant